Members: Joris Voogt, Xiaoyu Mao, Yuwen Peng
Student numbers: 4295978, 5947642, 5914507
Joris Voogt: - Created the choropleth map for Europe to compare passenger km relative to the GDP.
- Created the bar charts to show the percentage of road and rail passenger km relative to the total passenger km.
- Created Part IV - Streamlit.
- Visualized, modelled, and analysed this data as seen on the [Streamlit application](ttps://til6022-project-l3z3ycqjbpwyyp5vvudwlw.streamlit.app/).
Xiaoyu Mao: background research, country grouping in data processing, optimization and improvement on code
Yuwen Peng: data collection, correlation analysis, visualisation, data visualisation
Requires data modeling and quantitative research in Transport, Infrastructure & Logistics
RQ:
This research will focus on passenger transport trends and safety issues in Europe. We will first analyse the relationship between transport mode choice and the economy in different countries. The GDP is used to represent the economic development of the country, on the basis of which we will analyse the changes in the scale of road and rail passenger transport in European countries in recent years. By analyzing the volume of rail passenger and road passenger transport , the proportion of both in the overall volume of transport can be told. Then discuss whether there are countries who are highly dependent on road transport, and then we select the countries who have a large proportion of road transport to analyze the relationship between GDP and road safety.
Time scale: Data is a bit limited for passenger transport, so most likely in the range of 2016-2021.
Geographical boundary: Countries in Europe. Not all of them have data available in the datasets, so countries with most data available will be analysed.
The volume of passenger transport relative to GDP can give the relative scale of passenger transport in the national economy. Visualise it by creating an interactive map of Europe with different colors for each country based on the data, representing levels of the volume of passenger transport relative to GDP for different transport modes.
Analyse the changes in the scale of road and rail passenger transport in European countries in recent years. By analysing the volume of rail passenger transport and road passenger transport in each country and drawing a scale diagram, we can find out the proportion of the two in the overall volume of transport; and drawing a line graph can show the changes in recent years.
Further explore the impact of transport investment on transport safety. Discuss whether there are countries in 2 that are highly dependent on road transport, and then select countries with a large share of road transport to analyse the relationship between GDP and road safety.
Look for other possible factors that could be related to traffic mortality rate. Such as proportions of motorways, GDP per capita, alcohol consumption. This data is likely leading to new relationships and possibly correlates. Therefore, a correlation matrix is used, measuring strength and quickly giving an overview of all the numerical correlations.
Summarize the research and propose ideas about transport patterns in the future about road safety.
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
First step is to create new dataframes, clean the useless columns and rows from the original data, then rename the data columns and rows to be more clearly for the following data:
Start off with the volume of passenger transport relative to GDP.
# Import Volume of passenger transport relative to GDP: Read the csv file as a dataframe
file_path_pg_GDP = 'data/pg_GDP.csv'
df_pg_GDP = pd.read_csv(file_path_pg_GDP)
# Rename columns and rows
df_pg_GDP.columns = df_pg_GDP.iloc[6].tolist()
df_pg_GDP = df_pg_GDP.rename(columns={df_pg_GDP.columns[0]: 'GEO'})
# Clean the data by dropping columns of the dataframe that will not be used
# These rows are used to indicate the data, and columns are used to explain the accuracy of the data
df_pg_GDP = df_pg_GDP.iloc[8:48]
columns_to_remove = [2,4,6,8,10,12,14,16,18,20,22,24]
df_pg_GDP = df_pg_GDP.drop(df_pg_GDP.columns[columns_to_remove], axis=1 )
columns_to_remove = [7]
df_pg_GDP = df_pg_GDP.drop(df_pg_GDP.columns[columns_to_remove], axis=1 )
# Delete the rows without data and use Euro area or country as the row name
df_pg_GDP = df_pg_GDP.set_index('GEO')
df_pg_GDP = df_pg_GDP[~(df_pg_GDP == ':').all(axis=1)]
df_pg_GDP.head() # disaplay the first 5 rows of dataframe
| 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GEO | ||||||||||||
| European Union - 27 countries (from 2020) | 103.7 | 102.2 | 100.9 | 100.8 | 100.1 | 100 | 100 | 98.3 | 96.6 | 95.4 | 79.2 | 80.2 |
| Belgium | 114.1 | 110.5 | 114 | 111.5 | 103.7 | 100 | 97.4 | 96.6 | 95.5 | 93.8 | 80.9 | 79.9 |
| Bulgaria | 91.5 | 91.2 | 91.9 | 94.4 | 99.1 | 100 | 95.9 | 93.9 | 89.7 | 88.7 | 75.4 | 72.3 |
| Czechia | 101.2 | 100 | 99.6 | 100.7 | 103.1 | 100 | 100.7 | 99.2 | 100.9 | 101.5 | 83.8 | 99.2 |
| Denmark | 98.7 | 99.5 | 98.7 | 97.7 | 98.1 | 100 | 99.8 | 98.6 | 97.2 | 98.2 | 88.4 | 85.9 |
The next is the rail and road passengers of each country.
# Import Rail transport of passengers: Read the csv file as a dataframe
file_path_rail_pg = 'data/rail_pg.csv'
df_rail_pg = pd.read_csv(file_path_rail_pg)
# Rename columns and rows
df_rail_pg.columns = df_rail_pg.iloc[7].tolist()
df_rail_pg = df_rail_pg.rename(columns={df_rail_pg.columns[0]: 'GEO'})
# Clean the data by dropping columns of the dataframe that will not be used
df_rail_pg = df_rail_pg.iloc[9:49]
columns_to_remove = [2,4,6,8,10,12,14,16,18,20,22,24]
df_rail_pg = df_rail_pg.drop( df_rail_pg.columns[columns_to_remove], axis=1 )
# Delete the rows without data and reset the index
df_rail_pg = df_rail_pg[~(df_rail_pg == ':').all(axis=1)]
df_rail_pg = df_rail_pg.reset_index(drop = True)
df_rail_pg.head()
| GEO | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | European Union - 27 countries (from 2020) | : | : | : | : | 375,713 | 384,335 | 394,142 | 400,427 | 413,923 | 223,668 | 260,715 | : |
| 1 | Euro area – 20 countries (from 2023) | : | : | : | : | : | : | : | : | : | : | : | : |
| 2 | Euro area - 19 countries (2015-2022) | : | : | : | : | : | : | : | : | : | : | : | : |
| 3 | Belgium | 10,498 | : | : | : | : | : | : | : | : | : | : | : |
| 4 | Bulgaria | 2,059 | 1,870 | 1,821 | 1,698 | 1,549 | 1,455 | 1,434 | 1,476 | 1,520 | 1,118 | 1,203 | 1,600 |
# Import Rail transport of passengers: Read the csv file as a dataframe
file_path_road_pg = 'data/road_pg.csv'
df_road_pg = pd.read_csv(file_path_road_pg)
# Rename columns and rows
df_road_pg.columns = df_road_pg.iloc[7].tolist()
df_road_pg = df_road_pg.rename(columns={df_road_pg.columns[0]: 'GEO'})
# Clean the data by dropping columns of the dataframe that will not be used
# After observation, it was found that they are all sequence numbers with double digits
# Remove double numbered columns starting from the fourth column with index 3
df_road_pg = df_road_pg.iloc[9:37]
columns_to_keep = df_road_pg.columns[:1].to_list()
columns_to_remove = df_road_pg.columns[1::2]
df_road_pg = df_road_pg[columns_to_keep + columns_to_remove.to_list()]
# Delete the rows without data and reset the index
df_road_pg = df_road_pg[~(df_road_pg == ':').all(axis=1)]
df_road_pg = df_road_pg.reset_index(drop = True)
df_road_pg.head()
| GEO | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Belgium | 135,070 | 132,615 | : | : | : | : | : | : | : | : |
| 1 | Bulgaria | 8,949.9 | 8,666.600 | 9,919.700 | 10,017.800 | 9,548.600 | 8,916.900 | 8,017.500 | 8,952.700 | 4,803 | 4,611.600 |
| 2 | Czechia | 78,719 | 79,382 | : | 84,829 | 87,257 | 90,049 | 94,193 | 97,320 | 77,771 | 96,722 |
| 3 | Denmark | : | 66,794 | 68,140 | 70,962 | 73,393 | 74,468 | 74,934 | : | 69,742 | 70,835 |
| 4 | Germany | 990,582 | 998,566 | 1,013,748 | 1,027,500 | 1,048,003 | 1,015,350 | 1,013,700 | 996,500 | 866,646 | 846,687 |
Next is annual road investment and maintenace of each country. The data cover the amount of money spent annually measured in euros.Later on, we will focus on analyzing their relationship with road safety.
# Import Road Infrustructure Investment Spending: Read the csv file as a dataframe
file_path_road_inv = 'data/road_inv.csv'
df_road_inv = pd.read_csv(file_path_road_inv)
# First clean the data by dropping columns of the dataframe that will not be used
df_road_inv = df_road_inv.iloc[4:56]
# Remove double numbered columns starting from the fourth column with index 3
columns_to_keep = df_road_inv.columns[:1].to_list()
columns_to_remove = df_road_inv.columns[3::2]
df_road_inv = df_road_inv[columns_to_keep + columns_to_remove.to_list()]
# Rename columns and rows
new_column_names = [str(year) for year in range(1999, 2022)]
df_road_inv.columns = new_column_names
df_road_inv = df_road_inv.rename(columns={df_road_inv.columns[0]: 'GEO'})
# Delete the rows without data and reset the index
df_road_inv = df_road_inv[~(df_road_inv == '..').all(axis=1)]
df_road_inv = df_road_inv.reset_index(drop = True)
df_road_inv.head()
| GEO | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | ... | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Albania | 108,347,272 | 107,567,894 | 64,458,009 | 60,782,802 | 91,078,262 | 68,193,896 | 175,223,508 | 253,261,449 | 499,989,410 | ... | 180,820,940 | 234,163,844 | 192,718,553 | 179,236,201 | 89,140,344 | 159,032,880 | 169,252,490 | 175,879,111 | 223,292,017 | 265,726,260 |
| 1 | Australia | 3,985,384,644 | 3,232,914,774 | 3,174,141,910 | 3,809,467,524 | 4,673,112,063 | 6,047,684,590 | 6,696,458,225 | 7,813,640,359 | 7,904,953,310 | ... | 15,649,722,083 | 13,990,189,647 | 10,988,504,646 | 10,310,609,857 | 10,841,616,558 | 12,673,185,564 | 14,363,912,675 | 13,317,666,734 | 12,671,465,384 | 13,286,823,655 |
| 2 | Austria | 475,000,000 | 640,000,000 | 532,000,000 | 650,000,000 | 720,000,000 | 687,000,000 | 802,000,000 | 870,000,000 | 875,000,000 | ... | 327,000,000 | 363,000,000 | 453,000,000 | 455,000,000 | 444,000,000 | 515,000,000 | 463,000,000 | 562,000,000 | 548,000,000 | 480,000,000 |
| 3 | Azerbaijan | 30,255,355 | 47,961,631 | 46,789,989 | 34,212,659 | 48,289,409 | 82,349,945 | 260,388,800 | 374,213,302 | 1,328,480,437 | ... | 1,479,179,001 | 1,913,627,639 | 1,411,301,928 | 873,204,688 | 498,158,118 | 557,071,513 | 695,011,451 | 844,323,018 | 757,138,439 | 1,208,250,497 |
| 4 | Belarus | .. | .. | .. | .. | .. | 537,113,018 | 601,559,599 | 781,101,292 | 933,142,311 | ... | 1,581,319,399 | 1,446,139,503 | 1,357,658,853 | 1,007,813,385 | 872,967,572 | 960,754,682 | 1,004,655,803 | 1,157,793,622 | .. | .. |
5 rows × 23 columns
# Import Road Infrustructure Maintenance Spending: Read the csv file as a dataframe
file_path_road_maint = 'data/road_maint.csv'
df_road_maint = pd.read_csv(file_path_road_maint)
# Rename columns and rows
df_road_maint.columns = df_road_maint.iloc[2].tolist()
df_road_maint = df_road_maint.rename(columns={df_road_maint.columns[0]: 'GEO'})
# Clean the data by dropping columns of the unit that is meaningless when the selected country are all from Europe
# Then remove the rows from original data for explaination
df_road_maint = df_road_maint.iloc[4:46]
columns_remove = [1]
df_road_maint = df_road_maint.drop( df_road_maint.columns[columns_remove], axis=1 )
# Delete the rows without data and reset the index
df_road_maint = df_road_maint[~(df_road_maint == '..').all(axis=1)]
df_road_maint = df_road_maint.reset_index(drop = True)
df_road_maint.head()
| GEO | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | ... | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Albania | 3,838,572 | 5,705,714 | 6,824,432 | 5,767,021 | 6,829,499 | 7,110,073 | 5,670,979 | 6,046,008 | 8,056,200 | ... | 6,652,853 | 8,745,595 | 15,262,572 | 8,374,520 | 13,004,791 | 13,593,700 | 13,444,559 | 13,050,669 | 9,058,683 | 21,515,139 |
| 1 | Armenia | .. | .. | .. | .. | .. | 8,770,105 | 9,588,976 | 10,683,530 | 11,111,985 | ... | 10,706,482 | 10,085,012 | 10,050,491 | 11,236,616 | 11,586,347 | .. | .. | .. | .. | .. |
| 2 | Austria | 508,000,000 | 520,000,000 | 294,000,000 | 388,000,000 | 458,000,000 | 443,000,000 | 495,000,000 | 486,000,000 | 467,000,000 | ... | 517,000,000 | 559,000,000 | 667,000,000 | 692,000,000 | 697,000,000 | 687,000,000 | 726,000,000 | 752,000,000 | 791,000,000 | 899,000,000 |
| 3 | Azerbaijan | .. | 15,587,530 | 15,233,950 | 17,106,329 | 34,375,512 | 33,109,772 | 54,396,290 | 31,467,937 | 34,742,328 | ... | 34,537,201 | 31,669,866 | 31,660,750 | 22,909,507 | 18,702,182 | 27,802,090 | 25,390,819 | 32,049,598 | 52,056,489 | 46,719,682 |
| 4 | Belgium | 59,000,000 | 62,000,000 | 68,000,000 | 73,000,000 | 77,000,000 | 80,000,000 | 104,000,000 | 94,000,000 | 102,000,000 | ... | 145,000,000 | 147,000,000 | 206,000,000 | 457,000,000 | 528,000,000 | 396,797,000 | 216,000,000 | 213,000,000 | 155,000,000 | 498,000,000 |
5 rows × 23 columns
Then there is the inport of annual fatalities and casualties of each country.
# Import Road injury crashes, fatalities and injuries: Read the csv file as a dataframe
df_inj_kill = pd.read_csv('data/inj_kill.csv')
# First clean the data by dropping columns of the dataframe that will not be used
df_inj_kill = df_inj_kill.iloc[4:60]
# Remove double numbered columns starting from the fourth column with index 3
columns_to_keep = df_inj_kill.columns[:1].to_list()
columns_to_remove = df_inj_kill.columns[3::2]
df_inj_kill = df_inj_kill[columns_to_keep + columns_to_remove.to_list()]
# Rename columns and rows
new_column_names = [str(year) for year in range(1999, 2022)]
df_inj_kill.columns = new_column_names
df_inj_kill = df_inj_kill.rename(columns={df_inj_kill.columns[0]: 'GEO'})
# Reset the index
df_inj_kill = df_inj_kill.reset_index(drop = True)
df_inj_kill.head()
| GEO | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | ... | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Albania | 620 | 547 | 478 | 510 | 1,110 | 1,180 | 1,342 | 1,728 | 1,554 | ... | 2,569 | 2,798 | 2,617 | 2,692 | 2,779 | 2,611 | 2,291 | 2,044 | 1,598 | 1,860 |
| 1 | Argentina | .. | .. | .. | .. | .. | .. | .. | .. | .. | ... | 122,062 | 118,925 | 85,984 | 96,564 | 107,852 | 118,593 | 109,594 | 112,065 | .. | .. |
| 2 | Armenia | 1,377 | 1,495 | 1,448 | 1,546 | 1,751 | 2,084 | 2,421 | 3,091 | 3,532 | ... | 4,050 | 4,310 | 4,776 | 5,084 | 4,718 | 5,458 | 5,234 | 5,103 | .. | .. |
| 3 | Australia | 28,591 | 29,066 | 29,483 | 29,891 | 30,330 | 32,069 | 33,740 | 34,005 | 34,839 | ... | 35,367 | 36,322 | 36,703 | 38,286 | 40,237 | 40,553 | 40,732 | 40,649 | .. | .. |
| 4 | Austria | 55,905 | 57,223 | 57,640 | 57,812 | 56,735 | 54,002 | 52,660 | 53,902 | 51,200 | ... | 51,426 | 48,499 | 48,100 | 47,845 | 48,825 | 47,672 | 46,934 | 45,556 | 38,074 | 41,251 |
5 rows × 23 columns
# Import Road fatalities per one hundrend thousand inhabitants: Read the csv file as a dataframe
file_path_fat_inh = 'data/fat_inh.csv'
df_fat_inh = pd.read_csv(file_path_fat_inh)
# First clean the data by dropping columns of the dataframe that will not be used
# These rows are used to indicate the data, and columns are used to explain the accuracy of the data
df_fat_inh = df_fat_inh.iloc[3:62]
# Remove double numbered columns starting from the fourth column with index 3
columns_to_keep = df_fat_inh.columns[:1].to_list()
columns_to_remove = df_fat_inh.columns[2::2]
df_fat_inh = df_fat_inh[columns_to_keep + columns_to_remove.to_list()]
# Rename columns and rows
new_column_names = [str(year) for year in range(1999, 2022)]
df_fat_inh.columns = new_column_names
df_fat_inh = df_fat_inh.rename(columns={df_fat_inh.columns[0]: 'GEO'})
# Reset the index
df_fat_inh = df_fat_inh.reset_index(drop = True)
df_fat_inh.head()
| GEO | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | ... | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Albania | 9.1 | 9.7 | 8.2 | 8.6 | 10.4 | 10.2 | 9.3 | 12.9 | 10.3 | ... | 11.5 | 10.2 | 9.1 | 9.4 | 9.4 | 7.7 | 7.4 | 8.0 | 6.4 | 7 |
| 1 | Argentina | .. | .. | .. | .. | .. | .. | .. | .. | 14.3 | ... | 12.2 | 12.3 | 12.4 | 12.6 | 12.8 | 12.3 | 12.4 | 12.5 | .. | .. |
| 2 | Armenia | 6.8 | 7.6 | 7.6 | 8.2 | 8.4 | 10.2 | 11 | 12.3 | 13.6 | ... | 10.7 | 10.9 | 10.3 | 12.0 | 9.3 | 9.8 | 9.9 | 9.1 | .. | .. |
| 3 | Australia | 8.6 | 8.2 | 7.8 | 7.3 | 7.2 | 7.3 | 7.1 | 7 | 6.2 | ... | 5.6 | 5.5 | 4.9 | 5.1 | 5.3 | 5.0 | 4.5 | 4.7 | 4.3 | 4.3 |
| 4 | Austria | 12.2 | 11.9 | 11.8 | 11.5 | 10.7 | 9.3 | 8.8 | 8.3 | 8.2 | ... | 6.3 | 5.4 | 5.0 | 5.5 | 4.9 | 4.7 | 4.6 | 4.7 | 3.9 | 4 |
5 rows × 23 columns
# Import GDP per capita: Read the csv file as a dataframe
file_path_GDP = 'data/GDP.csv'
df_GDP = pd.read_csv(file_path_GDP)
# Remove the last row and use European region or country as the row name
df_GDP = df_GDP.iloc[:58]
df_GDP = df_GDP.rename(columns={df_GDP.columns[0]: 'GEO'})
#df_GDP = df_GDP.set_index('GEO')
df_GDP.head()
| GEO | 2000 | 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | ... | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Albania | 1,126.70 | 1,281.70 | 1,425.10 | 1,846.10 | 2,373.60 | 2,673.80 | 2,972.70 | 3,595.00 | 4,370.50 | ... | 4,413.10 | 4,578.60 | 3,952.80 | 4,124.10 | 4,531.00 | 5,287.70 | 5,396.20 | 5,343.00 | 6,377.20 | 6,802.80 |
| 1 | Andorra | 21,620.50 | 22,809.10 | 24,785.00 | 31,954.20 | 37,630.80 | 39,583.90 | 43,084.30 | 50,562.90 | 53,721.40 | ... | 44,747.80 | 45,680.50 | 38,885.50 | 39,931.20 | 40,632.20 | 42,904.80 | 41,328.60 | 37,207.20 | 42,072.30 | 41,992.80 |
| 2 | Armenia | 603.3 | 676.2 | 765.3 | 910.2 | 1,166.60 | 1,608.20 | 2,109.50 | 3,064.30 | 3,908.90 | ... | 3,833.20 | 4,017.20 | 3,666.10 | 3,680.00 | 4,042.00 | 4,391.90 | 4,828.50 | 4,505.90 | 4,966.50 | 7,014.20 |
| 3 | Austria | 24,625.60 | 24,558.80 | 26,527.60 | 32,294.00 | 36,889.20 | 38,417.50 | 40,669.30 | 46,915.30 | 51,920.00 | ... | 50,731.10 | 51,786.40 | 44,195.80 | 45,307.60 | 47,429.20 | 51,466.60 | 50,070.40 | 48,809.20 | 53,637.70 | 52,131.40 |
| 4 | Azerbaijan | 655.1 | 703.7 | 763.1 | 883.7 | 1,045.00 | 1,578.40 | 2,473.00 | 3,851.30 | 5,574.50 | ... | 7,875.30 | 7,890.80 | 5,500.50 | 3,880.70 | 4,147.20 | 4,739.80 | 4,805.80 | 4,229.90 | 5,408.00 | 7,736.70 |
5 rows × 24 columns
# Import all the European countries names
countries_iso = pd.read_csv('data/countries_iso.csv')
countries = countries_iso.iloc[:, 1].tolist()
# Import Proportion of motorways over the total road network(%),2010-2021
df_mot_rate = pd.read_csv('data/mot_rate.csv')
df_mot_rate = df_mot_rate.iloc[:,:11]
df_mot_rate.head()
# Import Motorization index(in cars per 1000 inhabitants),2000-2021
df_mot_index = pd.read_csv('data/mot_index.csv')
df_mot_index = df_mot_index.iloc[:,0:1].join(df_mot_index.iloc[:,11:21], how='outer')
df_mot_index.head()
# Import Unemployment rate(%),2010-2022
df_unempl_rate = pd.read_csv('data/unempl_rate.csv')
df_unempl_rate = df_unempl_rate.iloc[:,:11]
df_unempl_rate.head()
# Import alcohol consumption(%),2010-2021
df_alcohol = pd.read_csv('data/alcohol.csv')
df_alcohol = df_alcohol.iloc[:,0:1].join(df_alcohol.iloc[:,11:], how='outer')
df_alcohol.head()
# Reselect the range of other five data sources for casualties
fat_inh = df_fat_inh.iloc[:,0:1].join(df_fat_inh.iloc[:,11:21], how='outer').replace('..',np.nan)
inj_kill = df_inj_kill.iloc[:,0:1].join(df_inj_kill.iloc[:,11:21], how='outer').replace('..',np.nan)
road_inv = df_road_inv.iloc[:,0:1].join(df_road_inv.iloc[:,11:21], how='outer').replace('..',np.nan)
road_maint = df_road_maint.iloc[:,0:1].join(df_road_maint.iloc[:,11:21], how='outer').replace('..',np.nan)
GDP = df_GDP.iloc[:,0:1].join(df_GDP.iloc[:,11:21], how='outer').replace('..',np.nan)
df_alcohol.head()
| GEO | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Albania | 4.88 | 5.03 | 4.43 | 4.28 | 4.40 | 4.33 | 4.38 | 4.39 | 4.44 | 4.40 |
| 1 | Andorra | 10.53 | 10.46 | 10.33 | 10.06 | 9.88 | 11.02 | 11.18 | 11.18 | 10.72 | 10.99 |
| 2 | Armenia | 4.23 | 4.07 | 3.89 | 3.92 | 4.22 | 4.04 | 3.83 | 3.83 | 3.69 | 3.77 |
| 3 | Austria | 12.10 | 11.90 | 12.10 | 12.10 | 12.40 | 11.60 | 11.70 | 11.60 | 11.60 | 11.90 |
| 4 | Azerbaijan | 0.99 | 0.60 | 0.52 | 0.52 | 0.36 | 0.51 | 0.39 | 0.44 | 0.88 | 1.38 |
Divide countries based upon GDP:
# Divided the coutries into two groups: low-income and high-income countries, basing on GDP per captia
# Select the last five years as evaluation criteria
selected_years = ['2018','2019','2020','2021','2022']
data_selected_years = df_GDP[selected_years]
# Replace invalid values in the dataframe with NAN
data_selected_years = data_selected_years.replace('..', np.nan)
# Convert numerical values to floats
data_selected_years = data_selected_years.replace(',', '', regex=True).astype(float)
# Calculate the average GDP of each country
data_average_GDP = data_selected_years.mean(axis=1)
df_GDP['average GDP'] = data_average_GDP
# Calculate the average GDP of all countries
average_GDP = df_GDP['average GDP'].mean()
print(f'The average GDP of all the countries in last five years is {average_GDP}')
# Add a new column to mark their level of income: high or low
df_GDP['Income Group'] = ''
df_GDP.loc[df_GDP['average GDP'] > average_GDP,'Income Group'] = 'High Income'
df_GDP.loc[df_GDP['average GDP'] < average_GDP,'Income Group'] = 'Low Income'
# Create lists to devide the two groups of countries
High_inc = df_GDP[df_GDP['Income Group'] == 'High Income']['GEO'].to_list()
Low_inc = df_GDP[df_GDP['Income Group'] == 'Low Income']['GEO'].to_list()
# Print two lists of countries one by one
if True:
print('\033[1mHigh Income countries includes:\033[0m')
for country in High_inc:
print(country, end = ',')
print('\n\033[1mLow Income countries includes:\033[0m')
for country in Low_inc:
print(country, end = ',')
The average GDP of all the countries in last five years is 36946.38868421052 High Income countries includes: Andorra,Austria,Belgium,Channel Islands,Denmark,Faroe Islands,Finland,France,Germany,Greenland,Iceland,Ireland,Isle of Man,Liechtenstein,Luxembourg,Monaco,Netherlands,Norway,San Marino,Sweden,Switzerland,United Kingdom, Low Income countries includes: Albania,Armenia,Azerbaijan,Belarus,Bosnia and Herzegovina,Bulgaria,Croatia,Cyprus,Czechia,Estonia,Georgia,Greece,Hungary,Italy,Kazakhstan,Kosovo,Kyrgyz Republic,Latvia,Lithuania,Moldova,Montenegro,North Macedonia,Poland,Portugal,Romania,Russian Federation,Serbia,Slovak Republic,Slovenia,Spain,Tajikistan,Turkiye,Turkmenistan,Ukraine,Uzbekistan,
# The years we choose to analyze (based upon availability)
chosen_years = [str(i) for i in range(2013, 2022)]
# Keep the country column
chosen_years.insert(0, 'GEO')
# The countries we choose to analyze (based upon availability)
chosen_countries = ('Austria','Bulgaria', 'Denmark', 'Germany', 'France', 'Finland','Czechia', 'Croatia', 'Italy', 'Lithuania', 'Poland', 'Slovakia')
# Keep only the chosen years and countries in the road and rail passenger km data sets
df_road = df_road_pg[df_road_pg['GEO'].isin(chosen_countries)][chosen_years]
df_road.reset_index(drop=True, inplace=True)
df_rail = df_rail_pg[df_rail_pg['GEO'].isin(chosen_countries)][chosen_years]
df_rail.reset_index(drop=True, inplace=True)
# Combine road and rail data
df_psk = pd.DataFrame(columns=['Year', 'Country', 'Rail_psk', 'Road_psk', 'Total_psk', 'Percentage road', 'Percentage rail'])
for row in range(0, len(df_road.index)):
for col in range(1, len(df_road.columns)):
# Remove commas and change : to NaN
rail_val = float(df_rail.iloc[row, col].replace(',', '').replace(':', 'NaN'))
road_val = float(df_road.iloc[row, col].replace(',', '').replace(':', 'NaN'))
df_psk.loc[len(df_psk)] = [df_road.columns[col],
df_road.iloc[row, 0],
rail_val,
road_val,
rail_val+road_val,
road_val/(rail_val+road_val)*100,
rail_val/(rail_val+road_val)*100
]
# Show dataframe
df_psk
| Year | Country | Rail_psk | Road_psk | Total_psk | Percentage road | Percentage rail | |
|---|---|---|---|---|---|---|---|
| 0 | 2013 | Bulgaria | 1821.0 | 8666.6 | 10487.6 | 82.636638 | 17.363362 |
| 1 | 2014 | Bulgaria | 1698.0 | 9919.7 | 11617.7 | 85.384370 | 14.615630 |
| 2 | 2015 | Bulgaria | 1549.0 | 10017.8 | 11566.8 | 86.608224 | 13.391776 |
| 3 | 2016 | Bulgaria | 1455.0 | 9548.6 | 11003.6 | 86.777055 | 13.222945 |
| 4 | 2017 | Bulgaria | 1434.0 | 8916.9 | 10350.9 | 86.146132 | 13.853868 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 103 | 2017 | Finland | 4271.0 | 76002.0 | 80273.0 | 94.679407 | 5.320593 |
| 104 | 2018 | Finland | 4535.0 | 75961.0 | 80496.0 | 94.366180 | 5.633820 |
| 105 | 2019 | Finland | 4924.0 | 75861.0 | 80785.0 | 93.904809 | 6.095191 |
| 106 | 2020 | Finland | 2820.0 | 72061.0 | 74881.0 | 96.234025 | 3.765975 |
| 107 | 2021 | Finland | 2903.0 | 70760.0 | 73663.0 | 96.059080 | 3.940920 |
108 rows × 7 columns
This next block of code creates a map of Europe using plotly's choropleth. In this particular instance, it plots transport volumes relative to GDP. Most of this code can be reused for plotting other data sets using choropleth.
# Read in prepared data set, rename a column and extract the different years
df = pd.read_csv('data/pg_GDP_map.csv')
df = df.rename(columns={'pg_GDP': 'index'})
years = df.year.unique()
# Calculate max value for continuous colour range
max_pg_gdp = df['index'].max()
# Country names to be used in the labeling
customdata = df['country']
# Create choropleth figure
fig = px.choropleth(df,
locations='iso_alpha', # Iso-alpha-3 codes to signify which country
color='index',
hover_name='country',
color_continuous_scale=[[0, 'gray'], [0.01, 'gray'], [0.01, 'blue'], [1, 'red']], # Gray represents countries with no available data.
projection='miller',
range_color=(0,max_pg_gdp),
scope='europe',
animation_frame='year',
animation_group='country'
)
# Sets hover data
fig.update_traces(customdata=customdata,
hovertemplate= np.select([df['index'] == 0], ['<b>Country: </b> %{customdata}<br><br>No data available'], '<b>Country: </b>%{customdata}<br><br><b>Index value: </b>%{z}'))
# Gets value for the entire EU
numb = df.query('country=="EU"')['index']
# Get current year
current_year = fig.layout.sliders[0].steps[0].label
# Set title and EU index and legend position
fig.update_layout(title='<b>Volume of passengers relative to GDP in ' + current_year + ' (2015 base year)</b><br>EU index: ' + str(numb.tolist()[0]),
coloraxis_colorbar=dict(x=0,
y=0.5)
)
# Set figure size
fig.update_layout(width=1000,
height=600)
fig.show()
A high volume of passenger transport relative to GDP often indicates a robust and growing economy. When the volume of passenger transport is significant in comparison to the GDP, it suggests that the transportation sector is making a substantial contribution to economic growth.
A high ratio may suggest that a country's economy relies heavily on transportation services, such as tourism, logistics, or commuting. In this case, fluctuations in transportation services may have a substantial impact on the overall economic health of the country.
A low ratio relative to GDP might suggest that a country or region's transportation system is highly efficient. In such cases, the transportation sector doesn't need to consume a significant portion of the economic output to move people efficiently, indicating well-developed and cost-effective transport infrastructure.
1. Positive Trend: From 2010 to 2015, the ratio increased from 103.7 to 100, where 2015 was defined as 100. This indicates a relative increase in the volume of passenger transport in relation to GDP during this period. The passenger transport sector was playing an increasingly significant role in the European economy. This have been due to factors such as increased tourism, or infrastructure development and investment.
2. Stability: From 2016 to 2018, the ratio remained relatively stable, hovering around 100. This suggests that the passenger transport sector was maintaining its level of contribution to the economy, with no significant shifts in either direction.
3. Decline: From 2019 onwards, there was a notable decline. By 2022, the ratio had decreased to 80.2. This decline indicates a reduction in the relative contribution of the passenger transport sector to the GDP. This reduction could be influenced by various factors, such as changes in travel behavior (e.g.due to the COVID-19 pandemic), economic shifts, or increased efficiency in the transportation sector. No data augmentation after the pandemicmay because many people adapted to remote work and virtual meetings. As more people work from home, the need for daily commuting and business travel may decrease, impacting the passenger transport sector. Many businesses in the travel and tourism sectors suffered financial losses.Reduced business travel and tourism can lead to lower passenger transport relative to GDP.
Next, we look at the percentage of total passenger kilometres that is road (buses and coaches) and rail kilometres. We do this by plotting the dataframe created in part II as a bar chart with animation frames for the years.
# Remove Denmark as it has no entries for road_psk
df_psk_bar = df_psk.drop(df_psk[df_psk['Country'] == 'Denmark'].index)
# Create bar chart with animation frames for years
fig_bar = px.bar(df_psk_bar,
x='Country',
y=['Percentage road', 'Percentage rail'],
animation_frame='Year',
animation_group='Country'
)
# Set bar chart layout
fig_bar.update_layout(title='Percentage road and rail passenger km in millions',
yaxis_title='Percentage',
legend_title_text='',
)
# Set hovertemplate
fig_bar.update_traces(hovertemplate='<b>Country:</b> %{x}<br>'
'<b>Percentage:</b> %{y:,.1f}'
'<extra></extra>')
# Show bar chart
fig_bar.show()
To more easily compare the road and rail passenger kilometres for a singular country, we create a bar chart for a user supplied country below:
# Get the countries which are available and ask user for one of these
available_countries = ', '.join(df_psk_bar['Country'].unique())
chosen_country = input('Choose a country from: ' + available_countries)
# Filter dataframe on chosen country
df_psk_bar_country = df_psk_bar[df_psk_bar['Country'] == chosen_country]
# Create bar chart with animation frames for years
fig_bar = px.bar(df_psk_bar_country,
x='Year',
y=['Percentage road', 'Percentage rail'],
)
# Set bar chart layout
fig_bar.update_layout(title='Percentage road and rail passenger km in millions in ' + chosen_country,
yaxis_title='Percentage',
legend_title_text='',
)
# Set hovertemplate
fig_bar.update_traces(hovertemplate='<b>Year:</b> %{x}<br>'
'<b>Percentage:</b> %{y:,.1f}'
'<extra></extra>')
# Show bar chart
fig_bar.show()
Choose a country from: Bulgaria, Czechia, Germany, France, Croatia, Italy, Lithuania, Austria, Poland, Slovakia, FinlandFrance
From the above output images, the overall conclusion can be summarized as follows:
Analysis:
Analyzing the above trends for each country, we observe the following trends and changes:
Large economies such as Germany and France show a high percentage of road travel, probably due to their highly developed road network and vehicle penetration. Meanwhile, the Nordic country Finland, shows a relatively higher percentage of road trips, which may be related to its dispersed population and people's preference for road travel. But Slovakia, shows a relatively low percenatage of road travel, which may be due to the fact that Slovakia has a well-developed railway infrastructure and an extensive railway network, which makes people more willing to travel by rail.
Overall, we notice that eastern European countries make more use of buses and coaches compared to western European countries. Of course, this does depend on multiple factors.
2013-2021 can be broadly categorised into three periods based on the above trends in each country:
Analysis:
Investment in road construction during the current year shows an inverse relationship with the mortality rate in HIC, while it shows a direct relationship in LIC. This is an interesting phenomenon, and verified our hypothesis in RQ. Thus, in addition to the improvements of accessibility and territorial cohesion that road investments bring to HICs, they contribute to reduction of the mortality rate. This benefcial efect on road safety may be due to the great improvement brought about by the construction of new high capacity roads and the dualling of two-lane roads.
Regarding maintenance expenditure, the results for HICs and LICs coincide in terms of the reductive efect—though greater in LIC than in HIC, as well as in the level of signifcance of the variable overall. This indicates that an increase in investment in road maintenance will increase the number of fatalities and injuries. This indirectly reflects the shortcomings of maintenance policies and the lack of consideration for road safety.
Variables & Factors:
Choose the data of fatalities, injurie s& killed as two important dependent variables. Then choose other data including road investment, maintanace, motorway proportion, motorization rate, alcohol consumption, unemployment rate, GDP per capita as seven factors that could affect the changes of variables.
Method:
These data is likely leading to new statements or locksdowns and the data possibility correlates. The relation will therefore be explored by correlation analysis.
# Create dictionaries to store each countries of variables and factors
variable = {}
factor = {}
safety = {}
# Creating new dataframes using a loop
for country in countries:
# Set the data of fatalities, injuries & killed as variables
var1 = fat_inh[fat_inh['GEO'] == country]
var2 = inj_kill[inj_kill['GEO'] == country]
# Set new dataframe to save these two variables name as country
variable_country = pd.DataFrame()
for i, var in enumerate([var1,var2]):
# The absence of data from that country in the original data may occur
if not var.empty:
selected_variable = var.transpose()
variable_country = pd.concat([variable_country, selected_variable], axis=1)
else:
empty_index = ['GEO'] + [str(year) for year in range(2010, 2020)]
var = pd.DataFrame(np.nan, index = empty_index, columns =[0])
variable_country = pd.concat([variable_country,var],axis=1)
# Rename the columns as their original name
variable_country.columns = ['Road Fatalities',
'Injuries & Killed']
# Add the variable values of this country to the dictionary
variable[country] = variable_country
# To distinguish and abbreviate, renaming dataframe into factors 1-7
for idx,df in enumerate([road_inv, road_maint, df_mot_rate, df_mot_index, df_alcohol, df_unempl_rate, GDP],start=1):
new_df = df[df['GEO'] == country]
globals()[f'fac{idx}'] = new_df
# Set new dataframe to save these two variables name as country
factor_country = pd.DataFrame()
for i, fac in enumerate([fac1,fac2,fac3,fac4,fac5,fac6,fac7]):
if not fac.empty:
selected_factor = fac.transpose()
factor_country = pd.concat([factor_country, selected_factor], axis=1)
else:
empty_index = ['GEO'] + [str(year) for year in range(2010, 2020)]
fac = pd.DataFrame(np.nan, index = empty_index, columns = [0])
factor_country = pd.concat([factor_country, fac], axis=1)
# Rename the columns as their original name
factor_country.columns = ['Road Investment',
'Road Maintainance',
'Motorway proportion',
'Motorization',
'Alcohol Consumption',
'Unemployment rate',
'GDP per capita']
# Add the factors values of this country to the dictionary
factor[country] = factor_country
# Merge the dataframes of variable and factor into df safety and delete unnecessary lines
safety_country = pd.concat([variable_country,factor_country],axis=1)
safety_country = safety_country.iloc[1:]
safety[country] = safety_country.replace(',', '', regex=True).astype(float)
safety[country] = safety_country.dropna(axis=1)
# Unpack the dictionary
# Use a loop to read each DataFrame one by one in the dictionary
df_safety = []
for country, df in safety.items():
# Set name of each countries' dataframe
globals()[f"df_safety_{country}"] = df
df_safety.append(df)
| Road Fatalities | Injuries & Killed | Road Investment | Motorization | Alcohol Consumption | Unemployment rate | GDP per capita | |
|---|---|---|---|---|---|---|---|
| 2010 | 4.5 | 374,818 | 12,250,000,000 | 527.0 | 11.58 | 7.0 | 41,572.50 |
| 2011 | 5 | 396,374 | 12,290,000,000 | 534.0 | 11.33 | 5.8 | 46,705.90 |
| 2012 | 4.5 | 387,978 | 11,120,000,000 | 539.0 | 11.32 | 5.4 | 43,855.90 |
| 2013 | 4.1 | 377,481 | 11,390,000,000 | 543.0 | 11.08 | 5.2 | 46,298.90 |
| 2014 | 4.2 | 392,912 | 11,810,000,000 | 547.0 | 11.07 | 5.0 | 48,023.90 |
| 2015 | 4.2 | 396,891 | 11,430,000,000 | 548.0 | 11.06 | 4.6 | 41,103.30 |
| 2016 | 3.9 | 399,872 | 12,090,000,000 | 555.0 | 10.96 | 4.1 | 42,136.10 |
| 2017 | 3.8 | 393,492 | 13,520,000,000 | 561.0 | 10.86 | 3.8 | 44,652.60 |
| 2018 | 4 | 399,293 | 15,810,000,000 | 567.0 | 10.88 | 3.4 | 47,939.30 |
| 2019 | 3.7 | 387,276 | 16,750,000,000 | 574.0 | 10.56 | 3.1 | 46,793.70 |
# Using Poland as a representative of low-income countries
df_Finland = safety['Finland']
# Compute the Pearson's correlation coefficient
df_Finland = df_Finland.replace(',', '',regex=True).astype(float)
df_Finland_hm = df_Finland.corr(method = 'pearson')
# Make a heatmap to display the results
sns.heatmap(df_Finland_hm, annot = True)
plt.title('Pearson\'s correlation coefficient - Finland')
Text(0.5, 1.0, "Pearson's correlation coefficient - Finland")
# Using Poland as a representative of low-income countries
df_Poland = safety['Poland']
# Compute the Pearson's correlation coefficient
df_Poland = df_Poland.replace(',', '',regex=True).astype(float)
df_Poland_hm = df_Poland.corr(method = 'pearson')
# Make a heatmap to display the results
sns.heatmap(df_Poland_hm, annot = True)
plt.title('Pearson\'s correlation coefficient - Poland')
Text(0.5, 1.0, "Pearson's correlation coefficient - Poland")
From the heatmap above, several strong correlations can be seen. And the different groups High income(HIC) and low income(LIC) presents different patterns. In the group of HIC, represented by Finland. In the group of LIC, represented by Poland.
The motorway proportion has less correlations with safety in HIC. But the LIC shows strong instead. This might because the construction of motorway has stabilized over these years in HIC. But with different situation, the countries in LIC is still in the period of increasing construction, and the annual construction mileage shows more higher, it is likely lead to high mortality rate.
The coeficient between road fatalities, injuries&killed and the motorization are -0.81 and -0.98 respectively, which is very high. It can be inferred that when the motorization rate goes higher, the number of casualties will decrease. That is, they are anti correlated. LIC shows less correlations, this might because of the proposal of environmental law or some standard for motor vehicle ownership. The increase in residents' motor vehicles is decreasing year by year and tends to stabilize.
In HIC, the correlationship between alcohol consumption and injuries is up to 0.95, the most significant among these 7 factors. And so as the LIC, they are all positively correlated. Although the data were available in the World Bank database, interpretations derived from these data would be questionable, as they refer to the total population (over 15 years) and do not necessarily represent the attitudes of drivers regarding alcohol consumption.
While in the aspect of unemployment rate, the HIC did not result significant.But in LIC, the correlation is stronger. Just as we assumed: although the unemployment rate may be related to road safety, it is only a potential factor, and the actual impact may be mediated and influenced by other factors.
GDP also has correlation with road safety in LIC, but shows weak in HIC. The beneficial effect of GDP on road safety may be due to the impact that greater GDP has improvement of the health care system and on the existence of betterpassive protection measures both in vehicles and infrastructure itself, such effects proving valid at least in LIC.
This section shows the data used in the Streamlit application and contains a link to a working version.
The Streamlit application can be found here. The files on GitHub can be found under the Streamlit folder.
All the data used for the Streamlit application was obtained on OECD Stat. Each dataset retrieved is for the time period 1995-2021 which was the largest time period possible for all the datasets.
The datasets contain the following information: Country, year, population, gdp and government spending in millions of local currencies, road investment and maintenance costs in local currency, the amount of road passenger km in millions and the amount of road injuries & deaths combined.
In order to have all this data on the same countries, we first find the set of countries which every dataset contains. To find the countries in a dataset, we use the following function:
# Creates a set of all available countries in a file
def check_available_countries(file):
df = pd.read_csv(file)
country_set = set()
for country in df['Country']:
country_set.add(country)
return country_set
We then use this function to get the sets of countries for each dataset. This allows us to find the intersection between the sets and thus the available countries for all datasets.
# Get the available countries for each file
countries_road_fat = check_available_countries('streamlit/data/road_injuries_deaths.csv')
countries_road_inv = check_available_countries('streamlit/data/road_investment.csv')
countries_road_maint = check_available_countries('streamlit/data/road_maintenance.csv')
countries_road_pg = check_available_countries('streamlit/data/road_passengers.csv')
countries_pop = check_available_countries('streamlit/data/population.csv')
countries_gdp = check_available_countries('streamlit/data/gdp_government_spending.csv')
# Find the available countries in all files (the intersection)
available_countries = set.intersection(countries_road_fat, countries_road_inv, countries_road_maint, countries_road_pg, countries_pop, countries_gdp)
Now that we have the available countries, we remove any countries not in this set from all dataframes using the below function.
# This function removes countries not in the available countries set
def filter_available_countries(file):
df = pd.read_csv(file)
# Rename the time column to year as it contains years
df.rename(columns={'Time': 'Year'}, inplace=True)
df = df[df['Country'].isin(available_countries)]
df = df['Year'].drop_duplicates().to_frame().merge(df['Country'].drop_duplicates(), how="cross").merge(df, how="left")
# Store filtered file with the same name prepended with fil_
df.to_csv(file.split('/')[0] + '/fil_' + file.split('/')[1], index=False)
# Filter the files to only contain the available countries
filter_available_countries('streamlit/data/road_injuries_deaths.csv')
filter_available_countries('streamlit/data/road_investment.csv')
filter_available_countries('streamlit/data/road_maintenance.csv')
filter_available_countries('streamlit/data/road_passengers.csv')
filter_available_countries('streamlit/data/population.csv')
filter_available_countries('streamlit/data/gdp_government_spending.csv')
We now have datasets all containing the same countries and time period (were obtained using the same time period).
Next, we create required data for the analysis in the Streamlit application using the filtered datasets. In order to combine columns, the data needs to line up. Therefore, each dataframe is sorted on countries and years, and the index is reset.
# Sorts dataframes based upon country names alphabetically and years chronologically
def sort_countries(df):
return df.sort_values(by=['Country', 'Year'])
To be able to compare countries, we look at injuries and passenger kilometres relative to a countries population. In the code below, this data is combined and stored.
# Read in the required data
df_pop = pd.read_csv('streamlit/data/fil_population.csv')
df_inj = pd.read_csv('streamlit/data/fil_road_injuries_deaths.csv')
df_pas = pd.read_csv('streamlit/data/fil_road_passengers.csv')
# Sort the countries and years for the dataframes and reset the indices such that they are the same across dataframes
df_pop = sort_countries(df_pop)
df_pop.reset_index(drop=True, inplace=True)
df_inj = sort_countries(df_inj)
df_inj.reset_index(drop=True, inplace=True)
df_pas = sort_countries(df_pas)
df_pas.reset_index(drop=True, inplace=True)
# Create a new dataframe of the data that we require for the Streamlit application
df = pd.DataFrame(data={'Year': df_inj['Year'],
'Country': df_inj['Country'],
'Location': df_inj['COUNTRY'],
'Injuries': df_inj['Value'],
'Population': df_pop['Value'],
'Passenger_kilometres': df_pas['Value'],
'Injuries_passenger_kilometres': df_inj['Value'] / df_pas['Value'] * 1000,
'Percentage_inj_pop': df_inj['Value'] / df_pop['Population'] * 100,
'Percentage_inj_pk_pop': (df_inj['Value'] / df_pas['Value'] * 1000) / df_pop['Value'] * 100
})
# Save the created dataframe for use in the Streamlit application
df.to_csv('streamlit/data/app_data/road_safety.csv', index=False)
To be able to compare countries, we look at road expenditures as percentage of the total expenditures as we only have expenditures in local currencies. In the code below, this data is combined and stored.
# Read in the required data
df_gdp_gs = pd.read_csv('streamlit/data/fil_gdp_government_spending.csv')
df_rm = pd.read_csv('streamlit/data/fil_road_maintenance.csv')
df_ri = pd.read_csv('streamlit/data/fil_road_investment.csv')
# Split the combined GDP and government spending dataframe into two
df_gdp = df_gdp_gs[df_gdp_gs['STATISTICS'] == 'GDP']
df_gs = df_gdp_gs[df_gdp_gs['STATISTICS'] == 'T_PUB_EXP'].copy()
# Fill in missing data for Turkey in the government spending dataframe
for i in range(1995, 2000, 1):
df_gs.loc[-i] = [i, 'Türkiye', 'TUR', 'T_PUB_EXP', 'Total Government Expenditure (mln)', 'LOCAL_CUR', 'Local currency', i, np.NaN, np.NaN, np.NaN]
# Sort the countries and years for the dataframes and reset the indices such that they are the same across dataframes
df_gdp = sort_countries(df_gdp)
df_gdp.reset_index(drop=True, inplace=True)
df_rm = sort_countries(df_rm)
df_rm.reset_index(drop=True, inplace=True)
df_ri = sort_countries(df_ri)
df_ri.reset_index(drop=True, inplace=True)
df_gs = sort_countries(df_gs)
df_gs.reset_index(drop=True, inplace=True)
# Create a new dataframe of the data that we require for the Streamlit application
df = pd.DataFrame(data={'Year': df_gdp['Year'],
'Country': df_gdp['Country'],
'Location': df_gdp['LOCATION'],
'GDP': df_gdp['Value']*1000000, # GDP is in millions
'Gov_Spending': df_gs['Value']*1000000, # Government spending is in millions
'Maintenance': df_rm['Value'],
'Investments': df_ri['Value']
})
# Create percentage of total government spending columns for road maintenance and investments
df['Perc_Maintenance'] = (df['Maintenance']/df['Gov_Spending'])*100
df['Perc_Investments'] = (df['Investments']/df['Gov_Spending'])*100
# Save the created dataframe for use in the Streamlit application
df.to_csv('streamlit/data/app_data/road_expenditures.csv', index=False)
Here we show the two created datasets: (They are also available on the Streamlit application on the Data page)
df_saf = pd.read_csv('streamlit/data/app_data/road_safety.csv')
df_saf
| Year | Country | Location | Injuries | Population | Passenger_kilometres | Injuries_passenger_kilometres | Percentage_inj_pop | Percentage_inj_pk_pop | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1995 | Belgium | BEL | 73568.0 | 10136814.0 | 104228.0 | 705.837203 | 0.725751 | 0.006963 |
| 1 | 1996 | Belgium | BEL | 68259.0 | 10156641.0 | 104096.0 | 655.731248 | 0.672063 | 0.006456 |
| 2 | 1997 | Belgium | BEL | 70907.0 | 10181246.0 | 107343.0 | 660.564732 | 0.696447 | 0.006488 |
| 3 | 1998 | Belgium | BEL | 72260.0 | 10203012.0 | 112960.0 | 639.695467 | 0.708222 | 0.006270 |
| 4 | 1999 | Belgium | BEL | 72543.0 | 10226419.0 | 116460.0 | 622.900567 | 0.709369 | 0.006091 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 697 | 2017 | United States | USA | 2783000.0 | 325122128.0 | 6558301.0 | 424.347708 | 0.855986 | 0.000131 |
| 698 | 2018 | United States | USA | 2747000.0 | 326838199.0 | NaN | NaN | 0.840477 | NaN |
| 699 | 2019 | United States | NaN | NaN | 328329953.0 | NaN | NaN | NaN | NaN |
| 700 | 2020 | United States | NaN | NaN | 331511512.0 | NaN | NaN | NaN | NaN |
| 701 | 2021 | United States | NaN | NaN | 332031554.0 | NaN | NaN | NaN | NaN |
702 rows × 9 columns
df_exp = pd.read_csv('streamlit/data/app_data/road_expenditures.csv')
df_exp
| Year | Country | Location | GDP | Gov_Spending | Maintenance | Investments | Perc_Maintenance | Perc_Investments | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1995 | Belgium | BEL | 2.104891e+11 | 1.108116e+11 | 5.800000e+07 | 1.600000e+08 | 0.052341 | 0.144389 |
| 1 | 1996 | Belgium | BEL | 2.142871e+11 | 1.136839e+11 | 6.300000e+07 | 1.590000e+08 | 0.055417 | 0.139861 |
| 2 | 1997 | Belgium | BEL | 2.241015e+11 | 1.156402e+11 | 6.000000e+07 | 1.530000e+08 | 0.051885 | 0.132307 |
| 3 | 1998 | Belgium | BEL | 2.326238e+11 | 1.185435e+11 | 5.200000e+07 | 1.520000e+08 | 0.043866 | 0.128223 |
| 4 | 1999 | Belgium | BEL | 2.423076e+11 | 1.223574e+11 | 5.800000e+07 | 1.610000e+08 | 0.047402 | 0.131582 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 697 | 2017 | United States | USA | 1.947734e+13 | 7.412759e+12 | 5.144224e+10 | 9.660000e+10 | 0.693969 | 1.303159 |
| 698 | 2018 | United States | USA | 2.053306e+13 | 7.794634e+12 | 6.063702e+10 | 1.023000e+11 | 0.777933 | 1.312441 |
| 699 | 2019 | United States | USA | 2.138098e+13 | 8.221634e+12 | 6.150300e+10 | 1.077000e+11 | 0.748063 | 1.309959 |
| 700 | 2020 | United States | USA | 2.106047e+13 | 9.962874e+12 | 6.134600e+10 | 1.063000e+11 | 0.615746 | 1.066961 |
| 701 | 2021 | United States | USA | 2.331508e+13 | 1.047725e+13 | NaN | 1.086000e+11 | NaN | 1.036532 |
702 rows × 9 columns
The Streamlit application states on the correlation page that for some countries there is a correlation between money spend on road infrastructure and the amount of road injuries & deaths. However, it also shows that when looking at the road expenditures relative to the total expenditures, this correlation does not hold. We have to conclude that our hypothesis has been proven wrong.
The correlation that seems to hold for most countries is that for more recent years, the amount of injuries & deaths relative to road passenger kilometres declines. This could be the result of innovation in car technology, like automatic brakes, or changes to road infrastructure that have been a source of many accidents.